Word, Subword or Character? An Empirical Study of Granularity in Chinese-English NMT
نویسندگان
چکیده
Neural machine translation (NMT), a new approach to machine translation, has been proved to outperform conventional statistical machine translation (SMT) across a variety of language pairs. Translation is an open-vocabulary problem, but most existing NMT systems operate with a fixed vocabulary, which causes the incapability of translating rare words. This problem can be alleviated by using different translation granularities, such as character, subword and hybrid word-character. Translation involving Chinese is one of the most difficult tasks in machine translation, however, to the best of our knowledge, there has not been any other work exploring which translation granularity is most suitable for Chinese in NMT. In this paper, we conduct an extensive comparison using Chinese-English NMT as a case study. Furthermore, we discuss the advantages and disadvantages of various translation granularities in detail. Our experiments show that subword model performs best for Chinese-to-English translation with the vocabulary which is not so big while hybrid word-character model is most suitable for Englishto-Chinese translation. Moreover, experiments of different granularities show that Hybrid BPE method can achieve best result on Chinese-toEnglish translation task.
منابع مشابه
University of Rochester WMT 2017 NMT System Submission
We describe the neural machine translation system submitted by the University of Rochester to the Chinese-English language pair for the WMT 2017 news translation task. We applied unsupervised word and subword segmentation techniques and deep learning in order to address (i) the word segmentation problem caused by the lack of delimiters between words and phrases in Chinese and (ii) the morpholog...
متن کاملNeural Machine Translation of Rare Words with Subword Units
Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem. Previous work addresses the translation of out-of-vocabulary words by backing off to a dictionary. In this paper, we introduce a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as seq...
متن کاملA Character-Aware Encoder for Neural Machine Translation
This article proposes a novel character-aware neural machine translation (NMT) model that views the input sequences as sequences of characters rather than words. On the use of row convolution (Amodei et al., 2015), the encoder of the proposed model composes word-level information from the input sequences of characters automatically. Since our model doesn’t rely on the boundaries between each wo...
متن کاملLiterature Survey: Study of Neural Machine Translation
We build Neural Machine Translation (NMT) systems for EnglishHindi,Bengali-Hindi and Gujarati-Hindi with two different units of translation i.e. word and subword and present a comparative study of subword NMT and word level NMT systems, along with strong results and case studies. We train attention-based encoder-decoder model for word level and use Byte Pair Encoding (BPE) in subword NMT for wo...
متن کاملPre-Reordering for Neural Machine Translation: Helpful or Harmful?
Pre-reordering, a preprocessing to make the source-side word orders close to those of the target side, has been proven very helpful for statistical machine translation (SMT) in improving translation quality. However, is it the case in neural machine translation (NMT)? In this paper, we firstly investigate the impact of pre-reordered source-side data onNMT, and then propose to incorporate featur...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1711.04457 شماره
صفحات -
تاریخ انتشار 2017